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Abstract: This dissertation reports work where physics methods are applied to financial and economical 
problems. Some material in this thesis is based on 3 published papers 0001 which divide this study into two 
parts. The first part studies stock market data (chapter 1 to 5). The second part is devoted to personal income in 
the USA (chapter 6). 

We first study the probability distribution of stock returns at mesoscopic time lags (return horizons) ranging 
from about an hour to about a month. While at shorter microscopic time lags the distribution has power-law 
tails, for mesoscopic times the bulk of the distribution (more than 99% of the probability) follows an exponential 
law. The slope of the exponential function is determined by the variance of returns, which increases propor- 
tionally to the time lag. At longer times, the exponential law continuously evolves into Gaussian distribution. 
The exponential-to-Gaussian crossover is well described by the analytical solution of the Heston model with 
stochastic volatility. 

After characterizing the stock returns at mesoscopic time lags, we study the subordination hypothesis with one 
year of intraday data. We verify that the integrated volatility Vt constructed from the number of trades process 
can be used as a subordinator for a driftless Brownian motion. This subordination will be able to describe 
« 85% of the stock returns for intraday time lags that start at w 1 hour but are shorter than one day (upper time 
limit is restricted by the short data span of one year). We also show that the Heston model can be constructed by 
subordinating a Brownian motion with the CIR process. Finally, we show that the CIR process describes well 
enough the empirical Vt process, such that the corresponding Heston model is able to describe the log-returns 
Xt process, with approximately the maximum quality that the subordination allows (80% — 85%). 

Finally, we study the time evolution of the personal income distribution. We find that the personal income 
distribution in the USA has a well-defined two-income-class structure. The majority of population (97-99%) 
belongs to the lower income class characterized by the exponential Boltzmann-Gibbs ("thermal") distribution, 
whereas the higher income class (1-3% of population) has a Pareto power-law ("superthermal") distribution. 
By analyzing income data for 1983-2001, we show that the "thermal" part is stationary in time, save for a 
gradual increase of the effective temperature, whereas the "superthermal" tail swells and shrinks following the 
stock market. We discuss the concept of equilibrium inequality in a society, based on the principle of maximal 
entropy, and quantitatively show that it applies to the majority of population. 
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I. INTRODUCTION 

The interest of physicists in interdisciplinary research has 
been constantly growing and the area of what is today named 
socio-economical physics is 10 years old [4]. This new area 
in physics has started as an exercise in statistical mechan- 
ics, where complex behavior arises from relatively simple 
rules due to the interaction of a large number components. 



The pioneering work in the modern stream of economical 
physics was initiated by Mantegna |5] and Li |6] in the early 
nineties followed most notably by Mantegna and Stanley |7] 
and thereafter by a stream of papers |8] that attempt to iden- 
tify and characterize universal and non-universal features in 
economical data in general. This statistical mechanical mind 
frame arises in direct analogy with statistical mechanics of 
phase transitions, where materials (such as a ferromagnetic 
and a liquid), that are different in nature, can belong to the 
same universality class due to their behavior near the crit- 
ical point (point at which abruptly the phase changes, say 
from liquid to solid in water, for instance). These univer- 
sality classes are identified by critical exponents for quanti- 
ties that diverge at the critical point, for instance the specific 
heat C f» e~ a , where e is the reduced temperature and a 
the critical exponent |9]. Therefore, the area of economical 
physics has grown from, and it is still in great part concerned 
with, "power-law tails" with universal exponents. This consti- 
tutes the empirical stream of socio-economical physics, where 
modelling and characterizing the empirical data with methods 
and tools borrowed from traditional physical problems is at- 
tempted G1E0E- 

Soon after Mantegna and Li initiated the modern empirical 
stream of economical physics, simulations appeared. Once 
again, as in the case of empirical work, these were based into 
fundamental statistical mechanical models such as the Ising 
model. This literature attempted to construct from simple 
rules complex behavior that could then mimic the market and 
explain the price formation mechanism 

G3. Gil, El HQ- 

This dissertation belongs to the empirical stream of socio- 
economical physics. We study here two distinct problems. 
First, we use daily and intraday stock data to describe the 
essential nature of the stochastic process of price returns at 
different time ranges. Second, we use yearly income data to 
study the time evolution of the distribution of income in the 
USA. 



A. Stock returns 

The study of stock returns has a long history dating back 
to Bachalier in 1900, which was the first to model stock 
dynamics with a Brownian motion 11911 . He proposed that 
the absolute price change A St — St — St-u where t is 
the return horizon, should follow a Gaussian random walk. 
The clear drawback of such a hypothesis is that the prices 
of stocks could become negative. It was apparently Ren- 
ery fl9l Eol l2lll . who introduced the geometrical Brown- 
ian motion for the stock price by assuming that log-returns 
(x t = ln(Sx) - MSr-t) « AS t /S t ), and not absolute 
returns, should follow a Brownian motion. The geometric 
Brownian motion became popular and accepted as a main 
stream idea with the work of Osborne 1 22] (see also 1 19] for 
historical notes) and Samuelson (cited in 1 19J). 

It was not until the 1960's, that the hypothesis of Gaussian 
random walks was challenged by Mandelbrot 12311 and Fama 
124 1251 with studies on daily cotton prices. Since then, Brow- 
nian motion has been consistently questioned for a variety of 
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assets. Today asset log-returns that follow Brownian motion 
for all return horizons t are considered an exception. 

In his pioneering work, Mandelbrot introduced, as an alter- 
native model for stock returns, the stable Levy distribution. 
This distribution has the drawback that it can present infinite 
variance. Despite the unwanted mathematical properties that 
such a process presents, it was not founded into economical 
reasoning. In 1973 Clark 1 26] proposed, as an alternative to 
Mandelbrot's model, to use subordination l27ll to construct the 
distribution of assets returns. Subordination has a direct finan- 
cial implication, it can be liked with financial information ar- 
rival. Clark suggests that prices react to financial information 
and that if this financial information is taken into account, the 
gaussian random walk is recovered. He showed that the infor- 
mation arrival can be captured by volume of trades and that if 
one takes returns conditional on the volume, these should be 
Gaussian. 

Note that in fact, Mandelbrot and Clark do not contradict 
themselves, as Clark first implied. Mandelbrot's Levy sta- 
ble distribution can also be constructed by subordination, if 
one chooses the right subordinator for the Brownian motion. 
Therefore, the problem is reduced to finding the right subor- 
dinator if one accepts the subordination hypothesis. 

In physics, the concept of subordination can be found in 
the construction of non-Shannon entropies, in the limit of the 
continuous-time random walk, in interface growth models and 
other statistical mechanical problems [2d l2^l3(iH3lll3*3l . The 
mathematical- "physical" idea of subordination is that if the 
stochastic process is analyzed at the correct reference frame, 
it will always look like a simple gaussian diffusion. But since 
we are dealing with stochastic processes, the reference frame 
is moving randomly as well; just enough for the actual process 
in observation to be described by Brownian motion. For fur- 
ther mathematical development of subordination, see section 

Ivcl 

After Clark, the concept of subordination has been exten- 
sively used to construct asset return models I33ll34ll35ll33l . 
Most recently a series of studies have used high-frequency 
data to verify Clark's subordination hypothesis by either as- 
suming that the volume ll38l l39ll or the trading activity (num- 
ber of trades) is responsible for price changes. Strong 
evidence is found for both; nonetheless number of trades ap- 
pears better suited, since it has been extensively tested for a 
large number of companies |4j]]. 

Contemporary to Clark, a series of empirical studies indi- 
cated that the variance (variance — volatility 2 ) of stock re- 
turns is not constant (see i43ll and references therein). This 
resulted in models for stock returns such as Engle's ARCH 
and Bollerslev's GARCH that attempted to account for the 
changing variance in the assets returns by modelling both in 
a discrete framework |44|. At the same time, models with 
stochastic volatility were introduced. These models gener- 
ally assume a mean reverting continuous stochastic differen- 
tial equation for the volatility |45l EH ES Notice that 
stochastic volatility models, GARCH and subordination, are 
not entirely orthogonal to each other. Stochastic volatility 
models can also be constructed by subordination IB7I1 (see also 
section fVCl or as limits of discrete GARCH type models 1 47 ] . 



In 1993 Heston |48] introduced an exactly solvable stochas- 
tic volatility model that is also a limit process for the 
GARCH(1,1) model 0. The Heston model become widely 
used for option pricing and in the study of asset returns. We 
use a modified version of the Heston model as developed in 
Ref. |§-9] to describe the general shape of probability density 
distribution (PDF) for the log-returns and the time evolution 
of such PDF. 



B. Outline of the dissertation 

The outline of this thesis is as follows. In chapter |n] we 
introduce the Heston model for stock returns as developed 
in Refs. l49Tl . We summarize the procedure for finding 
the closed form solution of the probability distribution for the 
log-returns, starting from the correlated stochastic differential 
equations as given in Ref. 14911 . We also introduce subordi- 
nation and show how to construct the Heston model using a 
Cox-Ingersoll-Ross (CIR) subordinator ll7lll . 

In chapter ITTT1 we present the data we use in this thesis. 
We show the typical features of the stock data and how we 
constructed such data. 

In chapter ffVl we study the time evolution of the empiri- 
cal distribution function (EDF) for the stock returns at meso- 
scopic time lags t (1 hour < t < 20 days). We show that in 
the short-time limit t < < 1/7, the EDF progressively tends to 
the double exponential distribution and for the long-time limit 
t > > 1 /7, the EDFs progressively tends towards a Gaussian, 
where 1/7 is the characteristic time for such limits. Further- 
more, we show that the Heston model introduced in chapterllTI 
presents these fundamental features. 

In chapter|V] we study the hypothesis of subordination. We 
first start by pointing out the effect of the discrete nature of ab- 
solute price changes in the log-returns. Thereafter, we verify 
the subordination hypothesis using both tick-by-tick data (this 
data records all trades in a given day, see chapter FTTH as well 
as 5 minutes log-returns and number of trades (ticks) data. We 
find that if we use the integrated variance (Vt), which is pro- 
portional to the number of trades (N t ), as our subordinator, 
we are able to explain approximately the central 85% of the 
probability distribution for the log-returns xt between 1 hour 
and 1 day. Finally, we show the quality of modelling the sub- 
ordinator Vt with the CIR process introduced in section IvTI 
and discuss the implication of such model for the log-returns 
x t . 

The last chapter of this thesis presents work on the time 
evolution of the distribution of income. We show the evolu- 
tion of the distribution of personal income in the United States 
from 1983 to 2001. We show that the bulk of the distribution 
(excluding very small income and very large income), is de- 
scribed by the Exponential distribution with average income 
changing from year to year in approximately the same rate as 
inflation. We conclude that the inflation-discounted income 
of the majority of the population is approximately the same 
throughout time and therefore well approximated by a system 
in thermal equilibrium. We also show that the top 3% earn- 
ers have income that changes over time even when inflation 
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is accounted for. This chapter is self contained and does not 
require any other part of the thesis to be read. 



II. HESTON MODEL FOR ASSET RETURNS 



In order to solve Q and (|2ji together with 0, we first 
change variables from stock price St to mean removed (de- 
mean) log-return xt — hi(St/So) — fit ©. All further results 
and solutions are constructed for the demean log-return xt, 
which we will simply refer to as log-return or return: 



The Heston model was introduced by Heston |48] and be- 
longs to the class of stochastic volatility models, which have 
received a great deal of attention in the financial literature spe- 
cially in connection with option pricing 1 45 ] . 

Empirical verification of the Heston model was done for 
both stocks 01 1 El |H H and options EH EH EH, 
and good agreement with the data has been found in these 
studies. The version of the Heston model for stock returns 
used in 

in ism, 

as well as in this thesis, was modified from 
the original solution by Heston and has evolved into a different 
formula with 3 parameters. One parameter for the variance 
(9), one parameter representing the characteristic relaxation 
time to the Gaussian distribution (I/7) and another that gives 
the general shape of the curve (a). 

The outline of this chapter is as follows. First, we present 
the modified Heston model used in this work by showing its 
evolution from solving the related stochastic differential equa- 
tions (SDE). Thereafter, we introduce subordination and we 
show the development of the modified Heston model through 
subordination. 



dx t = dt + y/v^dWt 



(i) 



(4) 



After performing the change of variables from price to re- 
turn, we solve the Fokker-Planck equation 0162] implied by 
SDEs (0 and @, for the transition probability Pt(x, v \ Vi) to 
find the return x and the volatility v at time t given the initial 
demean log-return x = and variance u,; at t = 0. For sim- 
plicity, we drop the explicit time dependence notation for the 
returns xt and call them x. 



d_ 

dt' 



P 



9 u 



6)P] 



1 d 



i:) 2 



2dx 
1 d 2 



(vP) 



(5) 



The general analytical solution of l|5} for Pt(x, v\vi) with 
initial condition P t= o{x, v\ Uj) = 6(x)5(v — Vj) can be found 
by taking a Fourier transform x— > p x and a Laplace trans- 
form v— > p v (see l49ll for details), 



A. Heston model-SDE and symmetrization 

The formal way of presenting the Heston model is given by 
two stochastic differential equations (SDE), one for the stock 
price St and another for the variance v% ■ 



+00 

Pt{x\vi) = J dv P t (x,v\vi) 
"0 



dp 



2n 



(6) 

where the hidden variable v is integrated out, so p v = 0. 
Therefore we have 



dSt = fJtSt dt + at St dW t 



(i) 



dv t — —^{vt — 0)dt + Ky/v^dWt 



(2) 



(1) 



(2) 



where the subscript t indicates time dependence, fi is the drift 
parameter, and are standard random Wiener pro- 
cesses, cr t is the time-dependent volatility and v t — of is the 
variance. In general, the Wiener process in (0 may be corre- 
lated with the Wiener process in Q: 



(2) 



pdW t 



(i) 



yJ\-p 2 dZ u 



(3) 



where Z t is a Wiener process independent of W t , and 
p e [— 1 , 1] is the correlation coefficient. Note that Q and 
are well known in finance. These represent, respectively, 
the log-normal geometric Brownian motion stock process in- 
troduced by Renery, Osborne and Samuelson 111^1 (used by 
Black-Melton-Scholes (BMS) j6l0 for option pricing. See 
Ref. 16911 for a practical application of BMS to physics) and 
the Cox-Ingersoll-Ross (CIR) mean-reverting SDE first intro- 
duced for interest rate models l7lL 17211. 



Pt(x I Vi 



dp x 
2tt 



ipxx r +n coth (nt/2) 



X e 



-2^ lnfcosh + £ sinh SM.) 



■yret 



where 



and 



7 + ipnp x 



n= y/T 2 + K 2 (p 2 x -ip x ). 



(7) 



(8) 



(9) 



The marginal probability density P t (x \ Vi) could then be 
compared to empirical stock returns directly. Nevertheless, Vi 
has to be treated as an extra parameter. In order to avoid this, 
we assume that Vi has the stationary distribution of the CIR 
stochastic differential equation (0, n*(u), 



n,(«) 



a v 



a-l 



-av/9 



T(a) 8 a 



a 



2j6 



(10) 



Using equation dlOl we arrive at the probability distribution 
of the demean log-returns Pt(x), 
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Ft (a?) = / dv i n*(v l )Pt{x\v l ) 



(11) 



where the final solution is 



+ 30 



with 



F t ( Px ) = l T rt 



(12) 



(13) 



2 7 6» 



In 



where as before 



nt n 2 - r 2 + 2 7 r , m 
cosh T + 2^ slnh T 



r = 7 + ipKPa; 



and 



(14) 



(15) 



The operation of removing the initial volatility dependence 
of the marginal probability density Fs(x | Vi) using equation 
(1111 was first introduced in Ref. 14911 . This removes an ad- 
ditional degree of freedom and therefore simplifies the final 
marginal probability density. 

In order to further simplify the original Heston model, we 
assume that equations ([0 and are uncorrected. That 
amounts in taking p = in expression dl31 . This approxi- 
mation was shown to be acceptable for some companies and 
indexes in the US market [1, 2, 49] but might not be good for 
different markets l64ll or for option pricing EslEiUl . 

In order to arrive at the probability density function used 
in this work, we need to further simplify the equation for 
Pt(x, p = 0) J! 2b into a zero skew symmetrical function. 

We replace in (I12> p x — > p x + i/2 and p = to find 



Pt{x) 



2tt 



at 



F i(Px) = y-aln 



nt n 2 + i , nt 
cosh T + ^T smh T 



t = jt, a = 2j9/k 



.2 



fi = y/l + {p X Khy 



et. 



(19) 
(20) 

(21) 



We have expressed the original Heston model for the prob- 
ability density of log-returns x, in a highly symmetrical form 
with three parameters, 6, a and 7. The parameter 9 can 
be found by calculating the variance of demean log-returns 
of = (x 2 ) = Ot Mil of Pt{x) \\9\ . The remaining two pa- 
rameters, a and 7, are responsible for the general shape of the 
curve and the relaxation rate of P t (x) to a Gaussian distribu- 
tion l49ll . The parameter a is also responsible to define the 
analyticity at zero return. If a = 1, value used in this thesis, 
the short-time-limit is a double exponential distribution (see 
next subsection). This distribution is not analytical at zero 
but becomes when time progresses. For a > 1 the distribu- 
tion is always analytical with a center that is Gaussian and 
when a < 1 the distribution starts non-analytic at zero (going 
to zero as a power-law with exponent 2a — 1 [49]) and then 
evolves into a analytic distribution with Gaussian center. 

Notice that the average for the log-returns x from equa- 
tion dl9l is (a;) = 0. This average is not consistent with 

SDE (0}, but with the simplified dxt — -JvtdWj , where 
the drift term v t /2 is set to zero. Therefore, x in equation 
H9\ does only approximately represent demean log-returns 
x = ln(St/So) — pt. This difference arises because we took 
g-z/2 ^0 1 anc [ p% _|_ ^4 ~ ^2 j n e q Ua ];i on J17| in order to 

derive equation j20t . 

The log-returns x in equation dl9t can be exactly given by 
x = ln(St/So) — pt — u>(t), where the extra term, oj(t), re- 
moves the non zero average of x — ln(St/Sa) — pt- 

The extra term u>(t) arises because the average of the stock 
price at time t needs to be given by p only. Hence 



+co dp 



P t (x) = e-*/ 2 / ^ e ^+ F *W (16) 
2tt 



where a — 2^9/ k 2 , 



(S t ) = S e^{e Y *), (e Y *) = l, 
where Y t is the stochastic process 



(22) 



Ft(p x ) 



and 



ajt 
~2~~ 



-a In 



cosh ■ 



nt 



n 2 + 7 2 

2 7 ft 



sinh ■ 



fit 



(17) 



fi = vV + « 2 (P^ + 1/4) ~ 7Vl + P'(« 2 /7 2 )- (18) 

Finally, we drop the e -1 / 2 term in d!6i . Notice that both 
taking e~ x l 2 » 1 and p 2 + 1/4 p 2 are needed to produce a 
new characteristic function e F *( px ' that correctly goes to unity 
when p x = 0. The final functional form for P t (x) is 



St — So 



Soe fJ.t-ln(<e x t>)+X t 



< e Xt > 

uj{t) = -ln{< e Xt >) 
x t = ln(S t ) — ln(So) — pt = X t + u>(t) 

=*Y t = X t + u(t). (23) 

Empirically, the correction represented by uj(t) or by work- 
ing with equation dl6> instead of equation dl9l is small, and 
it can be safely neglected. We choose to work with x = 
ln(St/So) — pt — uo{t), and we call x in ( 1191 the log-return. 
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1. Short and long time limits of the Heston model 

The short time lag limit of the modified Heston model dl9l 
can be found by assuming ilt <C 2 in expression @. We also 
take ,o = and ip x — > 0, since we interested in the short- 
time-limit of the symmetric modified Heston model of equa- 
tion H9\ . When taking the limit fit -C 2 in 0, the resulting 
PDF is the Fourier inverse of the characteristic function of a 
Gaussian with random variance Uj and zero drift. Since Vi is a 
Gamma random variable with distribution dlOi . the final char- 
acteristic function for the short-time-limit distribution of the 
modified Heston model is 



Pt(px) = / d^e-^II.^) = (1 + ^P)- a . (24) 
The probability distribution can be found analytically |49] 



) 1 — Q 



f- 1/2 K a _ 1/2 (y) 



T(a) V tt^* 
where A' is the modified Bessel function and 



(25) 



y 



2ax 2 

et ' 



(26) 



For a = 1, we recover the Laplace (symmetrical double 
exponential) distribution 



Pt{x) 



--, y 



2ax 2 

et ' 



(27) 



NIG is part of the wide class of Levy pure jump models 
ll33ll . and the fact that it is recovered as a limit of the simpli- 
fied Heston stochastic volatility model (1191 . is another conse- 
quence of the randomization of Vi. Notice that if we take the 
long time limit before the randomization of Vi in the full He- 
ston model given in Eq. 0, we will not find NIG as the long 
time limit. 

The central limit theorem can be invoked for NIG and there- 
fore for Heston 1151 l27l D5l 14911 . That is, as time progresses, 
the distribution Pt(x) of returns x will become increasingly 
Gaussian. The characteristic time scale for the central limit 
theorem to act is to = 2/ (07). For t ^> to the probability dis- 
tribution is essentially Normal with mean zero and variance 

et. 

Notice that for long time lags t, there are two characteristic 
time limits. Heston tends to NIG for times t > 1/7 and then 
NIG tends to a Normal distribution for times t 3> l/cry. If 
a > 1, NIG and Heston regimes can not be effectively dis- 
tinguished. It is only in the case a < 1, that there will be a 
distinguished NIG regime. 

In summary, the most important limits for Pt(x) that we 
use in this study are: Exponential (if a = 1) at short time lags 
and Gaussian at long time lags, 



Pt{x) 



exp(-|x|^/270i), 
cx P (-.t 2 /26>£), 



t = jt < 1, 
t = 7 <> 1. 



(29) 



B. Heston model and subordination 

Subordination is a form of randomization in which one con- 
structs a new probability distribution, by assuming one or 
more parameters of the original probability distribution to be 
random 11231, 



Notice that the short time limit is not a Gaussian with vari- 
ance Vi, only because of the assumed randomization of Uj i24l . 
Therefore, this randomization has substantial effect in the lim- 
iting distributions, which can be checked empirically 1 2| (em- 
pirical results will be presented in chapter llVt . 

The long time lag t limit for the modified Heston model 
can be found by taking the limit fit ^> 2 in the characteristic 
function J20I . The resulting characteristic function is 



Pt(Px 



(28) 



The characteristic function in equation (12 8 1 is the charac- 
teristic function for the zero skew Normal Inverse Gaussian 
(NIG) model. NIG was first introduced by Barndorff-Nielsen 
to describe the distribution of sand particles sizes l73ll and was 
subsequently used in other physical problems such as turbu- 
lence 17411 . In 1995, Barndorff-Nielsen also introduced NIG 
for stock returns 13511 . NIG can also be obtained as a limit 
of the Generalized Hyperbolic distribution ll33l l75ll . as well 
as by subordinating a Brownian motion to the inverse gaus- 
sian distribution 1 33] (next section will introduce the idea of 
subordination). 



PNew(y,z) 



d9P(y, 



(30) 



In the case of subordination, a Markov process Y(N) 
is randomized by introducing a non-negative process N(t), 
called a randomized operational time. The resulting process 
Y(N(t)) does not need to be Markovian in general 12711 . We 
restrict ourselves to subordination of a Brownian motion with 
drift 9 and standard deviation a (13 1 1 . We also assume in what 
follows, that t is time lag in usual units of time, unless oth- 
erwise indicated. The probability density Pt(y) for the time 
changed Brownian motion Y(N) can be written 



Pt{y) 



dN- 



1 



V2na 2 N 



-(y-BN) A 

e 2^n P t (N). 



(31) 



The moments of a Brownian subordinated process are re- 
lated to the moments of the subordinator. If we use Pt (y) in 
13 H. the first 4 moments can be calculated as 



(y) = 9(N) N 



(32) 
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at 



((y - (y)) 2 ) = a 2 (N) N + 9 2 ((N - (N) N ) 2 ) N (33) 



((y- (y)f) = 3a 2 6{(N- (N) N ) 2 ) N + 9 3 ((N- (N) N f) N 

(34) 

((y - (y)) 4 ) = 3a 4 «(iV - (iVMV + (N) 2 N ) + 
69 2 a 2 (((N-(N) N f) N + 
{N) N {(N - (N) N ) 2 ) N ) + 9 4 ((N - (N) N ) 4 ) N , (35) 

where () refers to taking the expected value and () n refers to 
taking the expected value with respect to N. The time t depen- 
dence of the moments of Y are given by the moments of the 
randomized operational time N. Furthermore, even though 
the subordinator has odd moments, odd moments in the re- 
sulting process Y are only different from zero, if the Gaussian 
in equation < I3 1 1 has a drift 9^0. For the present work, we 
assume that the odd moments are all zero since the empirical 
probability distribution of log-returns are quite well described 
by zero skew probability distributions and because we work 
with mean zero returns Q|. By assuming zero odd moments 
probability distribution, we simplify the even moments. The 
second and fourth moments for Y depend only on the first and 
second moments of the subordinator N <33l35t . 

In the case of the modified Heston model dl9l >. the subor- 
dination takes the following terms. We assume that the log- 
returns x follow a Brownian motion with zero drift and vari- 
ance Vt- The variance Vt is our "random operational time", 
since it changes randomly. We will show in chapter M that 
the variance Vt can be estimated (at least partially) using the 
number of trades N t that occur in a the time interval t. The 
variance Vt is then a constant times Nt, Vt = <J 2 N t . 

The variance Vt is given by Vt — L dsv s , where the in- 
stantaneous variance v t appearing in the SDE (|2ji is integrated 
in the interval — ► t. For this reason, Vt is also know as in- 
tegrated variance. The Laplace transform for the conditional 
probability density -Pt(Vj| Vi) is analytically known l33ll7lll . 
Therefore, subordination becomes a useful tool to construct 
asset models with stochastic variance having the CIR process 
as a subordinator 1 3J$ . 

The Laplace transform of the subordinator of the modified 
Heston model i20\ can be read off immediately, 



P{Px 



P(p x ) = / dV t 



— P(Vt) (36) 



where the integral with respect to Vt defines a Laplace trans- 
form of the probability density P(Vt), for which the Laplace 
conjugated variable is calculated at p 2 ,/2. Therefore we arrive 
at 



Pt{V t )= J dp 



v t e 



pi/ t x+_F t -(p Vt ) 



(37) 



nt n 2 + i . , nt 

cosh 1 — — smn — 

2 2n 2 



F i(pv t ) = y - a In 
i = 7t, a = 2j9/k 2 , n = v/l + 2(K/7) 2 py t . (39) 



(38) 



The only difference between the characteristic exponent 
(I38i and the characteristic exponent for the Heston model (I20> 
is in n, where py t replaces p 2 /2 as the Laplace variable for 
V t . 

The first and second moments for the integrated CIR pro- 
cess J38i are 



(V t ) = 9t 
((Vt - (V)) 2 ) 



202 

«7 2 



(e 



-it 



l + 7 <). 



(40) 
(41) 



The time dependence of the variance J41I shows that the 
CIR process is not independent and identically distributed 
(IID). That is expected since we have a mean reverting SDE 
for the instantaneous variance v t with exponential relax- 
ation to the mean 

We have shown that subordinating a zero drift gaussian to 
the integrated V t , given by equation ( I37t is equivalent to solv- 
ing for the transition probability densities for the uncorrelated 
(p = in equation 0) system of SDEs dxt = ^/vtdW^ 

and dv t — —~f(vt — 9)dt + Ky/vtdW^ ©. However, it is not 
clear how to use subordination in order to produce a stochastic 
process that is equivalent to the correlated (p ^ 0) system of 
SDEsJU. 



HI. GENERAL CHARACTERISTICS OF THE DATA AND 
METHODS 

We use 2 databases for this study. Daily closing prices 
are downloaded from Yahoo l50ll and intraday data is con- 
structed using the TAQ database from the NYSE [51]. The 
TAQ database records every transaction that occurred in the 
market (tick-by-tick data), where the average number of trans- 
actions in a day for a highly traded stock, such as Intel, is 
20000 (from 1993 to 2001). That is equivalent, in terms of 
data quantity, to approximately 77 years of daily data. 

Our data has the time that the transaction occurred, the price 
the transaction was realized and the volume of the transaction 
(number of shares that exchanged hands). The TAQ database 
does not account for splits or dividends whereas Yahoo gives 
the prices corrected for splits and dividends. However we 
do need to correct for splits and dividends because the TAQ 
database is used only when constructing intraday returns. The 
splits and dividends are realized overnight and therefore will 
not show up if we calculate intraday returns. 

After downloading the TAQ data, we remove any trade that 
is recorded as an error and also restrict the data to trades that 
took place inside the conventional 6.5 hours trading day from 
9: 30 AM to 4: 00 PM. Any trade that happen before 9: 30 AM 
and after 4: 00 PM is ignored. We choose to restrict to busi- 
ness hours because we want our data set to agree with Yahoo 
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FIG. 1: Intraday stock price and number of trades constructed from 
the TAQ database at each 5 minute interval from Thursday, 2nd 
of January 1997 to Thursday 9th of January 1997 for Intel (upper 
panel). Volume of trades during each day is shown in the lower panel. 
Days are separated by an effective overnight time interval that is con- 
structed from the data, such that the open-to-close variance and the 
close-to-close variance of the log-returns follow the same oc t line 
(see Fig. [5}- 



daily data in the limit of one day that is defined from the open 
bell (9: 30 AM) to close bell (4: 00 PM). 

We define as the daily open price, the price of the first trade 
that happened after or at 9: 30 AM. We also define the daily 
close price, the price of the trade that happened right before or 
at 4: 00 PM. A typical time series for intraday prices, number 
of trades and volumes for 1 particular week is shown in Fig. 
□ 

Notice that the intraday volume and trading activity (num- 
ber of trades) can be well described by a parabola (Fig. QJ. 
This typical intraday pattern l52l l53ll has also been found 
for high-frequency volatility proxies, such as the root mean 
square return for all ticks that happen in a certain interval of 
time 1 54 l55l Pa 1571 l58l 15911 . The statistics for such a pattern 
for the number of trades of Intel in the year 1997 is shown in 
Fig. [3] Notice that the probability density for different parts of 
the day will clearly have different widths and averages. There- 
fore, mixing all parts of the day will result in a wider probabil- 
ity density for number of trades and other intraday quantities 
15311 . We do not study the consequences of such a mixture, we 
only are careful to work with intraday time lags that divide 
equally all day |2]. In such a way, all parts of the daily trend 
are equally represented. Since we are working with prices 
quoted at every 5 minutes (Five minutes close prices) and the 
day from open to close has only 78 such intervals, we work 
with returns that are t = 5, 10, 15, 30, 65, 130, 195, 390 min- 
utes long. 

Another important characteristic of daily and intraday data 
is shown in Fig. [2] The cumulative number of trades from 
1993 to 2001 (X)i=oi/oi/i993 Ni) increase almost exponen- 
tially. The behavior of the commutative number of trades 
shows that the average number of trades change from year to 
year. The same type of behavior is found for the square of the 
demean log-returns (the variance of the returns). Therefore, 
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FIG. 2: Cumulative number of trades and return from 1993 to 2001 
for Intel. The increase of the cumulative number of trades indicate 
that the parameters describing the stock are changing. 

INTC, 1997 
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FIG. 3: Average number of trades (ticks) in a given period of the 
day. The error bars represent the volatility. The red solid line gives 
the best fit parabola to the average number of trades. Same type of 
pattern is found for absolute returns 1 54] and volume. 



the probability density for the returns, volume and number of 
trades is only approximately stationary throughout the years. 
When studying returns (chapter HVl . we assume the data as 
stationary, and we take data from 1993 to 1999. When study- 
ing subordination using the number of trades (chapter[V), we 
reduce the non-stationary effect of the data by working with 
one year of data. 

In order to study intraday returns, we construct from the 
tick-by-tick data, 5 minutes close prices. The 5 minute close 
price is defined in analogy with the day close price. The 5 
minutes volume ( or number of trades (ticks)) is the sum of 
all traded volume ( or number of trades (ticks)) in a 5 minutes 
interval. 

When constructing intraday returns time series, we do not 
include nights or weekends. Effectively our largest intraday 
return is from open to close (time lag of 390 min = 6.5 hours). 
A common procedure, not adopted here, is to assume the open 
of the next day as the close of the present day 16(1 16111 . This 
will include returns that are effectively overnight, where no 



9 



INTC, 1997 



E 



— x with overnight //H\\ 

— x without overnight / // \\V 


\, 1 :05 hours 


j/ / 15 mlq' / \\ 




5 min 









l -l , , , , , , , 

—0.04 -0.03 -0.02 -0.01 0.01 0.02 0.03 0.04 
Log-return, x 



FIG. 4: Cumulative density function for the positive and negative 
log-returns of Intel. Log-returns constructed including overnight 
time lags (solid lines) show higher probability of large returns than 
log-returns that do not include overnight time lags (dashed lines). We 
choose not to include overnight time lags in our intraday return time 
series. 



trades are present. The result of such practice is illustrated in 
Fig. @] Clearly, the tails of the distribution of returns including 
overnight time lags are considerably enhanced, if compared 
with the distribution of intraday returns that do not include 
overnight time lags. 

When working with high-frequency (intraday) data record- 
ing errors are inevitable. In order to remove errors in the 
tick-by-tick data as well as our 5 minutes close time series, 
created from the tick-by-tick data, we use Yahoo database as 
our benchmark. We assume that the daily Yahoo database 
does not have errors. Our filtering technique consists of 
two parts. First, we calculate the log-return between the 
maximum and minimum price of a given day for the Ya- 
hoo data (xhl)- We then calculate the log-return {r§ m i n — 
Iti(St) — ln(Sr-5min)) f° r me 5 minutes price data in the 
same day and compare to thl- We replace any log -return 
\ft\ > thl with the return immediately preceding it. We 
also replace the number of trades and volume of the "cor- 
rupted" 5 minute interval by the immediately preceding ones. 
The second filtering procedure consists of requiring that the 
largest and smallest 5 minutes log-return (r^ m i n ) in a given 
day, be between the maximum and the minimum of all the 
time series formed by the yahoo open to close return data 
(min{roc) < r 5min < max(r c))- Once again, if the 
condition min(roc) < '"Smin < rnax(roc) is not satisfied, 
we replace the "corrupted" log-return, volume and number of 
trades by the immediately preceding one. 

The typical effect of such a simple error removal algorithm 
is to change less than 1% (on the order of 0.1%) of the data. 

The same filtering procedure is used for tick-by-tick data, 
except that instead of replacing the "corrupted" log-return and 
volume, we just ignore it. In fact ignoring or replacing by the 
nearest value is found to be equivalent (for tick-by-tick or 5 
minutes data) for the purpose of this work: the probability 
density and moments are the same. 



IV. MESOSCOPIC RETURNS 

The actual observed empirical probability distribution func- 
tions (EDFs) for different assets have been extensively stud- 
ied in recent years 

CI E3 M M M M Ez3 EE Ez3 M M . 

We focus here on the EDFs of the returns of individual large 
American companies from 1993 to 1999, a period without ma- 
jor market disturbances. By 'return' we always mean 'log- 
return', the difference of the logarithms of prices at two times 
separated by a time lag t. 

The time lag t is an important parameter: the EDFs evolve 
with this parameter. At micro lags (typically shorter than one 
hour), effects such as the discreteness of prices and transac- 
tion times, correlations between successive transactions, and 
fluctuations in trading rates become important (for discrete- 
ness effects see chapter |VJ|1 5, d. Power-law tails of EDFs 
in this regime have been much discussed in the literature be- 
fore 1 60, 61]. At 'meso' time lags (typically from an hour to 
a month), continuum approximations can be made, and some 
sort of diffusion process is plausible, eventually leading to a 
normal Gaussian distribution. On the other hand, at 'macro' 
time lags, the changes in the mean market drifts and macroe- 
conomic 'convection' effects can become important, so sim- 
ple results are less likely to be obtained. The boundaries be- 
tween these domains to an extent depend on the stock, the 
market where it is traded, and the epoch. The micro-meso 
boundary can be defined as the time lag above which power- 
law tails constitute a very small part of the EDF The meso- 
macro boundary is more tentative, since statistical data at long 
time lags become sparse. 

T he fi rst result is that we extend to meso time lags a stylized 
fact l 1 24ll known since the 19th century |82] (quoted in fiHl ): 
with a careful definition of time lag t, the variance of returns 
is proportional to t. 

The second result is that log-linear plots of the EDFs show 
prominent straight-line (tent-shape) character, i.e. the bulk 
(about 99%) of the probability distribution of log-return fol- 
lows an exponential law. The exponential law applies to the 
central part of EDFs, i.e. not too big log-returns. For the far 
tails of EDFs, usually associated with power laws at micro 
time lags, we do not have enough statistically reliable data 
points at meso lags to make a definite conclusion. Exponen- 
tial distributions have been reported for some world markets 
Q El]6l 00 El HI 111 and briefly mentioned in the 
book 1 15] (see Fig. 2.12). However, the exponential law has 
not yet achieved the status of a stylized fact. Perhaps this is 
because influential work 1 60, 61] has been interpreted as find- 
ing that the individual returns of all the major US stocks for 
micro to macro time lags have the same power law EDFs, if 
they are rescaled by the volatility. 

The Heston model is a plausible diffusion model with 
stochastic volatility, which reproduces the timelag-variance 
proportionality and the crossover from exponential distribu- 
tion to Gaussian. This model was first introduced by Heston, 
who studied option prices 1 4-8] . Later Dragulescu and Yako- 
venko (DY) derived a convenient closed-form expression for 
the probability distribution of returns in this model and ap- 
plied it to stock indexes from 1 day to 1 year |49]. The third 
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FIG. 5: Top panel: Variance (xl) vs. time lag t. Solid lines: Linear 
fits {xl) = 8t. Inset: Variances for MRK before adjustment for 
the effective overnight time T„. Bottom panel: Log-linear plots of 
CDFs vs. x/Voi. Straight dashed lines —\x\^2/8t are predicted by 
the DY formula \29\ in the short-time limit. The curves are offset by 
a factor of 10. 



result is that the DY formula with three lag-independent pa- 
rameters reasonably fits the time evolution of EDFs at meso 
lags. 



A. Data analysis and discussion 

We analyzed the data from Jan/1993 to Jan/2000 for 27 
Dow companies, but show results only for four large cap com- 
panies: Intel (INTC) and Microsoft (MSFT) traded at NAS- 
DAQ, and IBM and Merck (MRK) traded at NYSE (please 
see the appendix for more companies). We use two databases, 
TAQ to construct the intraday returns and Yahoo database for 
the interday returns (see Chapter llJll i. The intraday time lags 
were chosen at multiples of 5 minutes, which divide exactly 
the 6.5 hours (390 minuteslof the trading day. The interday 
returns are as described in 111 14911 for time lags from 1 day to 
1 month = 20 trading days. 

In order to connect the interday and intraday data, we have 
to introduce an effective overnight time lag T n . Without 
this correction, the open-to-close and close-to-close variances 
would have a discontinuous jump at 1 day, as shown in the in- 
set of the left panel of Fig. [5] By taking the open-to-close time 
to be 6.5 hours, and the close-to-close time to be 6.5 hours + 
T„, we find that variance {xf) is proportional to time i, as 
shown in the left panel of Fig. [5] The slope gives us the He- 
ston parameter 8 in Eq. i l2 It . T n is about 2 hours (see Table 

ID 



FIG. 6: Top panel: Theoretical CDFs for the Heston model plotted 
vs. x/y/Wi. The curves interpolate between the short-time exponen- 
tial and long-time Gaussian scalings. Bottom panel: Comparison 
between empirical (points) and the DY theoretical (curves) charac- 
teristic functions Pt(k). 

In the right panel of Fig. [5] we show the log-linear plots of 
the cumulative distribution functions (CDFs) vs. normalized 
return x/V0~i. The CDF t (:r) is defined as Pt(x') dx' , 
and we show CDF t (x) for x < and 1 — CDF t (x) for x > 0. 
We observe that CDFs for different time lags t collapse on 
a single straight line without any further fitting (the parame- 
ter is taken from the fit in the left panel). More than 99% 
of the probability in the central part of the tent-shape distri- 
bution function is well described by the exponential function. 
Moreover, the collapsed CDF curves agree with the DY for- 
mula d29l Pt{x) oc exp(— \x\\/2/6t) in the short-time limit 
for a = 1 114911 . which is shown by the dashed lines. 

TABLE I: Fitting parameters of the Heston model with a = 1 for the 
1993-1999 data. 





7 1/7 » T n 
t^— hour — hour 

hour year year 


INTC 


1.029 0:58 13.04% 39.8% 2:21 


IBM 


0.096 10:25 9.63% 35.3% 2:16 


MRK 


0.554 1:48 6.57% 29.4% 1:51 


MSFT 


1.284 0:47 9.06% 48.3% 1:25 



Because the parameter 7 drops out of the asymptotic Eq. 
( I29> . it can be determined only from the crossover regime 
between short and long times, which is illustrated in the left 
panel of Fig. [6] We determine 7 by fitting the characteristic 
function Pt(k), a Fourier transform of Pt(x) with respect to 
x. The theoretical characteristic function of the Heston model 
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FIG. 7: Comparison between the 1993-1999 Intel data (points) and 
the DY formula 1201 (curves) for PDF (top panel) and CDF (bottom 
panel). 



B. Conclusions 

We have shown that in the mesoscopic range of time lags, 
the probability distribution of financial returns interpolates be- 
tween exponential and Gaussian law. The time range where 
the distribution is exponential depends on a particular com- 
pany, but it is typically between an hour and few days. Sim- 
ilar exponential distributions have been reported for the In- 
dian JF7|, Japanese ll78ll . German Inland Brazilian markets 
Hlmas well as for the US market HlEollsill (see also He. 
2.12 in fTl l. The DY formula jH for the Heston model H 
captures the main features of the probability distribution of re- 
turns from an hour to a month with a single set of parameters. 

V. NUMBER OF TRADES AND SUBORDINATION 

The concept of subordination has important fundamental 
and practical implications. From a fundamental point of view, 
it gives a relation between microstructure of the market and 
price formation that can be exploited in simulations and mod- 
elling 1421 l55[ l84 18511 . From a practical point of view, the 
subordinator can be identified with the integrated variance V t 
115 61 18611 . This would imply a direct measure of the mean 
square return which could impact pricing and hedging both 
of options on a particular stock as well as variance swaps and 



is Pt{k) — e F iW i20\ . The empirical characteristic functions 
(ECFs) can be constructed from the data series by taking the 



sum Pt(k) = ReJfJx exp(— ikxt) over all returns xt for a 
given t 



1831 



Fits of ECFs to the DY formula \20\ are shown 
in the right panel of Fig.|6] The parameters determined from 
the fits are given in Table|I] 

In the left panel of Fig. [7] we compare the empirical PDF 
Pt (x) with the DY formula (120b . The agreement is quite good, 
except for the very short time lag of 5 minutes, where the tails 
are visibly fatter than exponential. In order to make a more de- 
tailed comparison, we show the empirical CDFs (points) with 
the theoretical DY formula (lines) in the right panel of Fig. 
We see that, for micro time lags of the order of 5 minutes, 
the power-law tails are significant. However, for meso time 
lags, the CDFs fall onto straight lines in the log-linear plot, 
indicating exponential law. For even longer time lags, they 
evolve into the Gaussian distribution in agreement with the 
DY formula J20i for the Heston model. To illustrate the point 
further, we compare empirical and theoretical data for several 
other companies in Fig.|8] 

In the empirical CDF plots, we actually show the ranking 
plots of log-returns xt for a given t. So, each point in the 
plot represents a single instance of price change. Thus, the 
last one or two dozens of the points at the far tail of each plot 
constitute a statistically small group and show large amount 
of noise. Statistically reliable conclusions can be made only 
about the central part of the distribution, where the points are 
dense, but not about the far tails. 
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FIG. 8: Comparison between empirical data (symbols) and the DY 
formula 1201 (lines) for CDF (left panels) and characteristic function 
(right panels). 



12 



options on the variance. 

In this chapter we verify and model the subordination hy- 
pothesis as given by Eq. i36\ . We will restrict our study to 
intraday Intel data in the year 1997. We restrict to a year of 
data because of the nonlinear drift of the number of trades: 
we would like to minimize this effect (see Fig. ??). We chose 
Intel because it has been studied by us in Ref. 1 2] (chapter lTVl 
and it can be modelled well with the Heston model introduced 
in chapter lTVl It is true that it is a highly traded stock, and that 
is an advantage, since that are a lot of trades in a day and there- 
fore the statistics is better. Therefore smaller stocks should be 
also checked in the future. The year of 1997 represents most 
of what one finds for other years, except perhaps 2000 and 
2001 which we did not verified because of technical problems 
(to large data set requires especial computing techniques that 
should be implemented in the future). 

We begin by showing the influence of the discrete nature of 
the absolute price change in the intraday log-return data. This 
is rarely pointed out, even th ough there is a vast literature on 
intraday log -returns fl5l Ir30l l6lt Ir38l Isvtl . This discreteness 
has to be accounted for when considering subordination, or 
even when studying intraday returns. It implies that a contin- 
uous probability density is only a convenient approximation 
for some return horizons. 

In section IV Bl we verify when and for what range of 
data does subordination apply. We assume that the integrated 
volatility Vt is the random subordinator of a driftless Brown- 
ian motion and that Vt is proportional to the number of trades 
Nt in an interval of time t. We also use tick-by-tick data to 
check for subordination by constructing the probability den- 
sity of the log-returns xn after N trades 06l >. 

In section IV CI we model the integrated variance Vt with 
the CIR process introduced in Eq. i38l . We present the level 
of agreement between the data and the theoretical CIR model 
and we link these results to the distribution of log-returns x t . 

In the last section, we present a summary of our findings. 



A. Discrete nature of stock returns 

On a tick-by-tick level, price changes are discrete. There is 
a minimal price change for bid and offers that is set by internal 
rules of the stock exchange. In the case of Intel in the year of 
1997, the minimal price change was $1/8 for the first part of 
the year and after June, 24th it became $1/16 |88, 89[. Never- 
theless, empirically we find that the smallest price change on 
realized transactions is h = $1/64 (Fig. [9}. This difference 
is a direct consequence of the mechanism of trading, and we 
will not study it here (see Ref. M In fill . We note that 
the minimal price change set by law is clear in Fig. |9j since 
the most probable price changes are indeed 0, ±4/i = $1/16 
and ±8h = $1/8, according to the rules of the NASDAQ ex- 
change in 1997. 

Our goal in this section is to identify the discrete nature 
of absolute price changes ll 26tl after N trades {m^h — 
S n — S n -N) in the log-returns after N trades (xjv = 
ln{S n ) — ln(S n -N)) and in log-returns after a time-lag t 
(x t — Iti(St) — In(ST-t)), since these log-returns are the 




-48 -40 -32 -24 -16 -8-4 4 8 16 24 32 40 48 
m„=(S-S„ „)/h,h = $1/64 



FIG. 9: Dimensionless absolute returns mjv = (S n — S n -N)/h 
for N trades in log linear and linear scale (center and bottom panels 
respectively). In the top panel we show the difference of the PDFs 
for tun and mjv_i to illustrate the oscillatory nature of the discrete 
PDF for absolute returns: it evolves from a "pulse" like shape for 
N = 1 to a "constant wave" for N = 4000. 



quantities that we ultimately want to model. We want to point 
out that the discrete nature of the log-returns for intraday work 
is generally overlooked but it can influence in the analysis of 
short returns. 

We will refer to minimal price change h — $ 1 /64 as "quan- 
tum of price" or simply "quantum" in analogy with quantum 
mechanics. 

The discrete nature of the price change can be used to model 
the price dynamics starting from a microscopic approach as 
recently suggested in l55ll57ll92T 19311 . We are interested in the 
limit where the quantum effect is not noticeable and therefore 
quantities such as number of trades and returns can be treated 
as continuous random variables. 

Fig. [9] shows the probability density for the dimensionless 
absolute price return = (S n — S n -N)/h after N trades 
in steps of one quantum h. The nature of the tick-by-tick dis- 
tribution (N = 1) is considerably different from N = 4000. 
More than 50% of the returns are zero for N = 1, and most 
of the other returns have a probability of less than 1% except 
±4/i and ±8/1. The probability has a clearly oscillatory nature 
where multiples of Ah are maxima (Fig. |9j top panel). After 
4000 trades the probability distribution for mjv has changed 
into a two level system (Fig. |9jl- The probability of the most 
probable m^r in N = 1 have now approximately the same 
probability. Therefore, the zero return has (after 4000 trades) 
a comparable probability to the other probability maxima. 

The quantum nature of the price changes is removed by 
working with log-returns, except for the zero return. Notice 
that intraday log-returns can be approximated by the ratio l94ll 
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FIG. 10: Effect of taking log-returns instead of taking absolute re- 
turns. Lower panel shows the probability density of the dimension- 
less log-returns xn/Ii conditioned on mjv, P(xn //i|mjv). The 
values concentrated about a multiple of h (upper panel), spread 
about their respective h value. The vertical color coded lines (lower 
panel) indicate the h value from which each, equally color coded, 
P(xn /h\rriN) originated. The discreteness of tum is removed by 
taking log-returns since the spread of P(a;jv//i|mjv) is larger than 
h. 



mi,Nh = SiN — S(i-i-)N, i = 1, 2, 3 
m N 



Xi,N 



C=S /h,i = 1,2, 3,(43) 



where So is the first open of the year (in the case of Intel 1997, 
So = $131.75). 

The effect of taking log-returns is illustrated in Fig. ^3 
For each absolute return tojv, there is a potentially different 
denominator S^-jv / h (I42> composed by a random walk with 
integer valued steps about a level C ( I43> . Clearly the values 
of the ratio x n will not be integer. Therefore, the ratio of tojv 
in Eq. (I43> spreads the concentrated discrete absolute returns 
multiple of h, around the multiple. 

The lower panel of Fig. [K)]shows the probability density of 
xn /h conditioned on mjv. The conditional probability den- 
sity P(xn /h\mjsi) illustrates a spread for each tun that is 
larger than h. This spread is enough to mix the discreteness 
with exception of tun = 0. 

The quality of such a mixture can be seen in Fig. II Hand 
Fig. [21 Even though the cumulative density function for xm 
is practically continuous (even for N = 1) with exception of 
xn = 0, the stepwise nature of mjy can be easily recognized 
up to N = 1000 (Fig. II 2t . The oscillations in the cumulative 
density functions for x n are centered about the discrete steps 
of the cumulative density function of tojv. 

The discrete quantum effect at mjv = is quite persistent, 
but it can be neglected for returns x n with large number of 
trades N (for instance N = 4000). Empirically, it appears 
that the criteria for neglecting the mjv = effect is that the 
probability of having mjy = is of the same order of mag- 
nitude as the probability of having any other ttim (Fig|9j. For 
Intel 1997 this transition starts approximately at N — 1000. 

The effect of data discreteness is also present in the log- 
return x t of time lag t. From the log -return Xt, we can con- 
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FIG. 11: Cumulative probability density for both dimensionless log- 
retums, xn /h (black line), and dimensionless absolute returns, mjv 
(blue symbols). Even though the discreteness of mjv is removed 
with exception of xn = 0, the signature of such discreteness is still 
visible. Notice the stepwise nature of the black line. 
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FIG. 12: Cumulative probability density for both dimensionless log- 
retums, XN/h, and dimensionless absolute returns, mjv. When TV 
increases the CDF becomes progressively less oscillatory and the dis- 
crete nature of the underlying absolute returns becomes less clear. 
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FIG. 13: Cumulative probability density for xt/h with t = 5 min- 
utes. The discreteness at zero persists from xn/Ii as well as the 
oscillation (stepwise nature) of the CDF. 

struct a; at by conditioning on the number of trades N present 
in t (Nt). The opposite is also true, by conditioning on t we 
can construct Xt from xn- Therefore some of the discrete 
effects that are present in xn will be present in x t . As an ex- 
ample consider 5 minute log-returns. The average number of 
trades is (N t= 5 m in) = 200 ± 184. Because of the reciprocity 
in constructing the PDF for x t from x n (and vice-versa) by 
conditioning, this shows that in the composition of Xt=^ m in, 
there is a wide range of xn for which the discrete features 
can not be ignored (clear oscillations and large probability for 
Xt — 0). If we approximate the PDF of N t= 5 m in by a Gaus- 
sian distribution, we would have in Xt=5min, with the high- 
est probability, N t= 5 m in — 200. Therefore some fraction of 
xn = 200 will be sampled when we construct the probability 
of Xt=5min by conditioning, these returns clearly have a lot 
of discrete features (Fig. I12l i and these features will pass to 

^t—5mi n ■ 

Fig. ^] shows the oscillatory stepwise cumulative proba- 
bility density and also the special nature of Xt—^min = for 
the cumulative probability density of Xt=5min- Compare this 
figure with Fig. ^2 an d Fig- U3 These features originate from 
xn and represent small flat portions in the probability density 
function. 

Finally, from the sequence of Figs. [^and[2]and the cor- 
respondence between xn and xt, we can conclude that the 
discrete effects become negligible for a time lag t > 1 hour. 

B. Verifying subordination with intraday data 

The hypothesis of subordination introduced by Clark j2r3l 
has had a strong economical implication, and following his 
work there is a vast body of theoretical and empirical work 
which addresses the issue 1 38, 39, 40, 4 ll 14211 . Similar to the 
work of Refs. 1 40, 41], we verify for subordination consid- 
ering integrated variance Vt, constructed from the number of 
trades Nt, to be the subordinator of a Brownian motion. 

Due to the discrete nature of the distribution of intraday re- 
turns presented in section iV Al . we can only talk about sub- 
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FIG. 14: Variance of the log-return x N for N = 1 to TV = 10000. 
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FIG. 15: Variance of the demean log-return xt for intraday time lags 
t. 
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FIG. 16: Average number of trades in an intraday interval t. 

ordination as formulated in equation d36i after the discrete ef- 
fects become small. In what follows, we will take all time lags 
even those where the discrete effects are large. Nevertheless, 
we will see that the best subordination will take place for time 
lags for which discrete effects can be ignored. 

The first implication of subordination can be verified with 
the use of moments given by equations d33l and j35l >. Figs. 
E]and^]show the linear time relation for both the variance 
of xt and the mean of Nt as expected from equation J33l >. 
Furthermore, since we are assuming a Brownian motion with 
stochastic variance given by the number of trades, log-returns 
xn after N trades should be Gaussian distributed with vari- 
ance (x N ) = cr N N. Fig. U4lshows the linear relation of (x 2 N ) 
vs. N. The implied consistency between the slope values in 
Figs. 1141 [T5l and[T6lrequired by subordination is 

(x 2 t ) =6t = a N {N t ) =a 2 N r]t = a 2 N V- (44) 

Using expression 144\ . the difference between 9 measured 
(Figll5> and 9 = t](t n from Fig. 1 141 and Fig. [H[]is less than 
1%. 

In order to find a time and a return range where sub- 
ordination takes place, we look at the data in 3 different 
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FIG. 17: Cumulative probability density for the demean and stan- 
dard deviation (STD) normalized xn log-returns (color coded solid 
curves), compared to the Gaussian distribution of mean zero and 
STD one (dashed curve). From small N to large N, there is a pro- 
gressive agreement with the Gaussian with best agreement between 
N = 3500 and N = 4500. While smaller values of N have CDFs 
above the Gaussian, larger values are below the Gaussian. 
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FIG. 18: Skewness and excess kurtosis (labelled as "kurtosis" in the 
figure) as a function of TV for the normalized log-returns xn in Fig. 
1171 For a Gaussian distribution the skewness is zero and the excess 
kurtosis is also zero. As the number of trades (ticks) N increase the 
skewness and excess kurtosis become zero. The probability density 
for x at can be well approximated by a Gaussian for TV > 2500, since 
both skewness and excess kurtosis are small. 



ways. First, using tick-by-tick data, we construct the distri- 
bution of the log-return x^ after N trades, xn should be 
Normal distributed with mean zero and standard deviation 
ctatv^V- We also present the N dependence of the skewness 

} 2 )-3)of 



i{x%}/{(x 2 N ) 3 / 2 )) and excess kurtosis ((x%) / ((x 
x N in Fig.lTtfl 

Second, using t minute returns x t and the number of trades 
N t in the same t interval, we construct the time series 



e t = V t = a%N u 

V Vt 



(45) 



where V t is the integrated variance in an interval t and 
is the proportionality constant that converts number of trades 




FIG. 19: Cumulative density function (CDF) for e t as defined in 
equation I45i for three different t compared to the Gaussian (solid 
line). The parameters crjv in J45I is chosen for the best agreement 
between the Gaussian and the data. 




FIG. 20: Cumulative density function (CDF) for e t as defined in 
equation I45i for three different t compared to the Gaussian (solid 
line). Contrary to Fig. 1191 the parameter crjv in equation J45I is 
found using Fig. 1141 Notice that the Gaussian lies above the data in 
the tails. 



Nt into variance. If indeed subordination holds, e t is Normal 
distributed with mean zero and standard deviation one, due to 
the central limit theorem 1 27.1 14 ill . 

Finally, we check subordination by numerically calculating 
the probability mixture equation d36i . We construct the prob- 
ability density function of the number of trades Nt inside a 
time interval t by binning the time series of Nt. The choice 
forbinwidth is according to Ref. l95ll . However, the result ap- 
pears independent of binwidth as long as the bin width chosen 
is not too large. The cumulative probability density function 
for the measured x t and the non-parametric reconstructed x' t 
are shown in Fig. 

The distributions in Fig. Fig. ^] and Fig. 12 If solid 
line) show an agreement of approximately 85% of the data 
with the subordination hypothesis for time lags above t > 1 
hour or N > 2500 (Fig. II 8t i. However, the subordination is 
clearly bad for times close to one day (t — 6.5 hours), where 
we do not have enough data (253 points) to draw meaningful 
conclusions. 

Notice the clear disagreement above 2 standard deviations 
(STD) as well as at zero in Fig. [n]and Fig. ^] The deviations 
at zero are due to the discrete nature of the data (section lV At 
while the deviations above 2 STD show that the subordination 
hypothesis can not explain the large changes in returns 1 42] . 
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FIG. 21: Cumulative distribution of the stock returns xt compared 
to the reconstructed cumulative distribution function (black lines) by 
randomizing the variance Vt of a Gaussian distribution. The proba- 
bility of Vt is constructed by binning the number of trades, and this 
probability is used non-parametrically in the integral 1361 . The solid 
lines have parameter <tjv chosen in order to minimize the least square 
error between the empirical xt distribution and reconstructed vari- 
ance changed Brownian motion I36i . The dashed line has on found 
from Fig. 1141 



ForFig.[l9]and Fig.EjTsolid line), a 2 N = 2 x 1CT 8 is found 
to give the best agreement between the measured data and the 
reconstructed data. For Fig. Fig. [20]and Fig. l21T dashed 
line), a 2 N = 2.39 x 1CT 8 is found from Fig. [TJ] Notice that 
the higher in Fig. |20|and Fig. 12 II (dashed lines) seems to 
indicate an overestimation of cat, since the curves constructed 
by subordination are generally above the data. 

The lower value of a% for Fig. ^]and Fig. [^(solid line) 
leads to a violation of relation J44i . The difference between 
measured 8 in Fig. ^]and the one calculated from rjajj is 
now of approximately 16%. In order to verify the origin of 
such difference, we remove 8% of the largest log-return x t 
data on both tails (ignore 8% of the largest xt on the positive 
and negative tail for all time lags t used), a total of 16% of 
the data. We find now a 8 w 8.01 x 10~ 7 . This new 8 does 
not violate relation J44l > with a 2 N = 2 x 10~ 8 and reconfirms 
that subordination with Vt = <J%Nt is unable to explain large 
changes (> 85%) in the log-returns x t . This reconfirmation 
arises because we had to ignore 16% of the data in the tails to 
reduce 8. Dropping 16% of the tails is equivalent to looking 
only at the center w 85% of the data and saying that subordi- 
nation is only valid of it. 



C. Models for the subordinator 

Having verified that a Brownian motion subordinated to the 
number of trades N t via Vt can describe approximately 85% 
of the return data for time lags larger than 1 hour (or, if one ig- 
nores discreetness effects such as the zero return effect, larger 
than 30 minutes), we can model Vt instead of modelling xt- 

In this section, we verify the quality of modelling Vt with a 
CIR process as given in section JIl Bl >. We present the quality 



of the CIR fit for Intel in the year 1997. We also show that the 
quality of the Heston fit to x t with parameters from the Vt CIR 
fit is consistent with the quality of the subordination: we are 
able to model most of the central 85% of the xt distribution. 

Due to previous studies with intraday log-returns |2|] (see 
also chapter HVt . we assume a = 1 for the simplified CIR 
model in equation J38i . The parameter 8 is found from the re- 
lation 8 = rj<j\j J44> . The remaining parameter 7 is found by 
fitting the empirical PDF(Vt) for time lags t = 1: 05 hours 
and t = 2: 10 hours simultaneously. The regular quality of 
such a fit is shown in Figs. [22] and [23] The theoretical CIR 
lines are above the data (Fig. I23> . Furthermore, the time de- 
pendence of the theoretical PDF and CDF only approximately 
follow the data. For times below 1 hour the probability maxi- 
mum of the empirical distribution is to the left of the theoreti- 
cal distribution and for times above 1 hour to the right. 

The results shown in Figs. I22land l23l indicate that the CIR 
is only approximately valid. The quality can be further as- 
sessed by constructing the variance of the Vt as a function of 
the time lag t. Fig. [24] shows that the theoretical variance 
given in equation J41t is only approximately correct. Never- 
theless from equation i33\ , we know that the variance of Vt 
corresponds to the kurtosis of xt- This indicates that even 
though Vt can not be modelled well (not even the second mo- 
ment) the implication of that is only important to the fourth 
and higher moments in the log -returns x t . 

To verify the quality of the parameters found by fitting the 
subordinator, Vt, in explaining the log-returns, Xt, we present 
Figs. [25]and[2i The empirical PDF ([25} and CDF @D for 
x t show that the corresponding Heston model (dashed black 
lines), constructed with parameters found by fitting CIR to the 
probability density of Vt, is able to fit only the center of the 
empirical distributions of xt (~ 80% — 85%) at t — 65, 130 
minutes (Fig. 1261. 

To recheck the consistency of the subordination approach, 
we fit the empirical PDF of x t directly with the Heston model 
j20t . We proceed in similar fashion to the fitting procedure in 
chapterllVl We assume a = 1 and take 8 = 8.01 x 10~ 1 . The 




2 3 

V , integrated variance 



FIG. 22: Empirical probability density function for the number of 
trades (ticks) Nt or integrated variance Vt = a%Nt, compared to 
the least square fit with the CIR formula 1381 . Curves are offset by a 
factors of 10. 
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parameter 8 was found from the relation 8 = a 2 N rj (I44i . where 
i] is found from Fig. ^]and a 2 N is given such that the subor- 
dination in Figs. I19landl21lis the best possible. Finally, we fit 
the empirical PDFs (Fig. 125 \ for the parameter 7. Therefore, 
we are effectively only fitting 7, since all the other parameters 
are the same used in the Vt fit (Fig. I22> . We find that the 7 
found from fitting the empirical PDF of x t directly, is of the 
same order of magnitude as with the one found by fitting the 
empirical PDF of Vt (0.05 from x t and 0.06 from Vt). This 
shows, that the subordination indeed captures most of the in- 
formation for the center of the distribution, since fitting V t or 
Xt for 7 is equivalent. 

Notice that the agreement of the theoretical Heston model 
curves, constructed with parameters from the Vt fit, is practi- 
cally identical to the agreement found in Fig. 12 If solid lines) 
between the CDF of x t and the CDF constructed by subor- 
dination using the non-parametric binned probability density 
of Vt as the variance of a Gaussian random walk \36\ . The 
information content in the number of trades and therefore in 
the integrated variance distribution is almost all captured by 
CIR, even with a regular fit quality (Fig. I23> . This last point 
implies that even if we had a better fit to the distribution of Vt, 
the increase in the fitting quality of the log-returns will not be 
substantial. 

A substantial increase in the fitting quality of the empirical 
PDF and CDF of the log-returns in Figs. [25]and[26]is attained 
if one fits the empirical PDF of x t directly with 8 = 9.53 x 
10 -7 given in Fig. This amounts to take <r^ as given by 
Fig. ^]and 7] by Fig. ^] such that relation J44b is still valid. 
The parameter 7 = 0.02 for the black solid lines in Fig. [26] 
is also considerably different from 7 = 0.06, found by fitting 
the empirical PDF of Vt and using 8 = 8.01 x 10~ 7 such that 
u 2 N is the best fit value for the subordination in Figs. 12 If solid 
line) and^] The substantial increase in the fitting quality for 
Xt, reemphasizes that the number of trades are only able to 
describe the center of the distribution of log-returns (section 
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FIG. 23: Cumulative distribution function (CDF) for the number of 
trades Nt and integrated variance Vt compared to the CIR fit (solid 
lines). The CDF(Vt) goes from to 0.5. 1 - CDF{Vt) goes from 
0.5 to 0. The lower tail (Vt : 0- > 0.5) of the CDF is to the left 
and the upper tail (Vt : 0.5— > 0) to the right of 0.5 for each time t 
curve. 
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FIG. 24: Variance of the integrated variance (V t 2 -{V t } 2 } for differ- 
ent time lags t for the data (circles) compared to the theoretical CIR 
variance given in equation 1411 (solid black line). For comparison the 
best power-law fit (V 2 — (Vt) 2 ) oc t 1 ' 77 is shown (solid red line). 
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FIG. 25: Probability distribution function for the log-returns xt com- 
pared to the Heston model (dashed and solid lines). The two lines 
represent a different set of parameters. The solid line has parameters 
8 from Fig. ll4l and 7 is found directly by fitting x t . The dashed lines 
has 9 — G%r\ with a 2 N from Fig. l21f solid lines) and Fig. 1 191 and r\ 
from Fig- 1161 The parameter 7 is then found by fitting the probability 
density of Vt- Curves are offset by factors of 10. 



ED. 



D. Conclusion 

We have studied the discrete nature of the probability dis- 
tribution of absolute returns that arises from the minimal dis- 
crete price change for bid and offers allowed by the stock ex- 
change. We have shown that such discrete nature implies that 
the probability distributions of log-returns for intraday time 
lags are only approximately continuous. The continuous ap- 
proximation becomes good for returns with time lags longer 
than 1 hour. 

We have shown that, using the integrated volatility Vt = 
a^Nt derived from the number of trades Nt as the subordi- 
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FIG. 26: Cumulative probability density of xt compared to the He- 
ston model. Theoretical lines (dashed and solid) are constructed by 
integrating the theoretical probability density functions shown in Fig. 
1251 The two theoretical lines represent a different set of parameter. 
The solid line has parameters 6 from Fig. ^|and 7 is found directly 
by fitting x t . The dashed lines has 6 — a\,r\ with a% from Fig. 
l21l solid lines) and Fig. 1 191 and 77 from Fig. 1161 The parameter 7 is 
then found by fitting the probability density of Vt. Notice that the 
solid black line clearly gives a better fit to the data. 

nator of a driftless Brownian motion (1361 . we are able to de- 
scribe the center (w 85%) of the distribution of log-returns x t 
for time lags t > 1 hour and smaller than t < 1 day. The 
upper limit is restricted by the number of data points we have, 
since we are working with only one year of data. 

We also have shown that the CIR process is only able to 
approximately describe the distribution function for V t . How- 
ever, this approximate description is already enough for the 
corresponding Heston model to fit the log-returns Xt with ap- 
proximately the maximum quality that the subordination al- 
lows (» 80% - 85%). 

Finally, a direct fit to the log-returns x t with the Heston 
model results in a considerable increase in the fitting quality. 
This reemphasizes that the process of subordination, as im- 
plied by the empirical probability density of Vt, is only able 
to explain the center of the distribution of returns. 



VI. INCOME DISTRIBUTION 

Attempts to apply the methods of exact sciences, such as 
physics, to describe a society have a long history 1 96] . At the 
end of the 19th century, Italian physicist, engineer, economist, 
and sociologist Vilfredo Pareto suggested that income distri- 
bution in a society is described by a power law |97]. Modern 
data indeed confirm that the u pper tail of inco me distribution 
follows the Pareto law However, the 

majority of the population does not belong there, so charac- 
terization and understanding of their income distributi on re- 
mains an open problem. Dragulescu and Yakovenko fiol 
proposed that the equilibrium distribution should follow an 
exponential law analogous to the Boltzmann-Gibbs distribu- 
tion of energy in statistical physics. The first factual evidence 



for th e exponential distribution of income was found in Ref. 
1 104]. Coexistence of the exponential and power-law parts of 
the distribution was recognized in Ref. 1 105]. However, these 
papers, as well as Ref. 110611 . studied the data only for a par- 
ticular year. Here we analyze temporal evolution of the per- 
sonal income distribution in the USA during 1983-2001. We 
show that the US society has a well-defined two-income-class 
structure. The majority of population (97-99%) belongs to the 
lower income class and has a very stable in time exponential 
("thermal") distribution of income. The upper income class 
(1-3% of population) has a power-law ("superthermal") dis- 
tribution, whose parameters significantly change in time with 
the rise and fall of the stock market. Using the principle of 
maximal entropy, we discuss the concept of equilibrium in- 
equality in a society and quantitatively show that it applies to 
the bulk of the population. 



A. Data analysis and discussion 

Most of academic and government literature on income dis- 
tribution and inequality 1 1 07l 1 1 Qgl 1 1 09t IllOll does not attempt 
to fit the data by a simple formula. When fits are performed, 
usually the log-normal distributio n 111 ill is used for the lower 
part of the distribution 1 100, 1 lL 1 1 0211 . Only recently the ex- 
ponential distribution started to be recognized in income stud- 
ies fulfill , and models showing formation of two classes 
started to appear Ill4llll5ll . 

Let us introduce the probability density P(r), which gives 
the probability P(r) dr to have income in the interval (r, r + 
dr). The cumulative probability C(r) = f dr'P(r') is 
the probability to have income above r, C(0) = 1. By 
analogy wit h the Boltzmann-Gibbs distribution in statisti- 
cal physics 1 103l 110411 . we consider an exponential function 
P(r) oc exp(— r/T), where T is a parameter analogous to 
temperature. It is equal to the average income T = (r) = 
J °° dr'r 1 P{r r ), and we call it the "income temperature." 
When P(r) is exponential, C(r) oc exp(— r/T) is also expo- 
nential. Similarly, for the Pareto power law P(r) oc l/r Q+1 , 
C(r) oc l/r a is also a power law. 

We analyze the data ll 16ll on personal income distribution 
compiled by the Internal Revenue Service (IRS) from the tax 
returns in the USA for the period 1983-2001 (presently the 
latest available year). The publicly available data are already 
preprocessed by the IRS into bins and effectively give the cu- 
mulative distribution function C(r) for certain values of r. 
First we make the plots of log C (r) vs. r (the log-linear plots) 
for each year. We find that the plots are straight lines for the 
lower 97-98% of population, thus confirming the exponen- 
tial law. From the slopes of these straight lines, we determine 
the income temperatures T for each year. In Fig.^J we plot 
C(r) and P(r) vs. r/T (income normalized to temperature) 
in the log-linear scale. In these coordinates, the data sets for 
different years collapse onto a single straight line. (In Fig. 
l27l the data lines for 1980s and 1990s are shown separately 
and offset vertically.) The columns of numbers in Fig. I27llist 
the values of the annual income temperature T for the corre- 
sponding years, which changes from 19 k$ in 1983 to 40 k$ 
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80.34 120.51 160.68 200-85 



1990, 27.06 k$ 

1991, 27.70 k$ 

1992, 28.63 kS 




Rescaled adjusted gross income 



FIG. 27: Cumulative probability C'(r) and probability density P(r) 
plotted in the log-linear scale vs. r/T, the annual personal income 
r normalized by the average income T in the exponential part of the 
distribution. The IRS data points are for 1 983-200 1 , and the columns 
of numbers give the values of T for the corresponding years. 
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Adjusted gross income in 2001 dollars, k$ 




1983. 19.35 k$ 

1984. 20.27 k$ 

1985. 21.15 k$ 

1986. 22.28 k$ 

1987. 24.13 k$ 

1988. 25.35 k$ 
I, 26.38 k$ 



Rescaled adjusted gross income 



FIG. 28: Log-log plots of the cumulative probability C'(r) vs. r/T 
for a wider range of income r. 



in 2001. The upper horizontal axis in Fig.l27lshows income r 
in k$ for 2001. 

In Fig.|28] we show the same data in the log-log scale for 
a wider range of income r, up to about 300T. Again we ob- 
serve that the sets of points for different years collapse onto 
a single exponential curve for the lower part of the distri- 
bution, when plotted vs. r/T. However, above a certain in- 
come r* ~ 4T, the distribution function changes to a power 
law, as illustrated by the straight lines in the log-log scale 
of Fig. [28] Thus we observe that income distribution in the 
USA has a well-defined two-class structure. The lower class 
(the great majority of population) is characterized by the ex- 
ponential, Boltzmann-Gibbs distribution, whereas the upper 
class (the top few percent of population) has the power-law, 
Pareto distribution. The intersection point of the exponential 
and power-law curves determines the income r* separating 
the two classes. The collapse of data points for different years 
in the lower, exponential part of the distribution in Figs. 
and [28] shows that this part is very stable in time and, essen- 
tially, does not change at all for the last 20 years, save for 
a gradual increase of temperature T in nominal dollars. We 
conclude that the majority of population is in statistical equi- 
librium, analogous to the thermal equilibrium in physics. On 
the other hand, the points in the upper, power-law part of the 
distribution in Fig.|28]do not collapse onto a single line. This 
part significantly changes from year to year, so it is out of 



statistical equilibrium. A similar two-part structure in the en- 
ergy distribution is often observed in physics, where the lower 
part of the distribution is called "thermal" and the upper part 
"superthermal" Ill7ll . 

Temporal evolution of the parameters T and r* is shown in 
Fig. [29] We observe that the average income T (in nominal 
dollars) was increasing gradually, almost linearly in time, and 
doubled in the last twenty years. In Fig. |29] we also show the 
inflat ion coefficient (the consumer price index CPI from Ref. 
ill 18ll l compounded on the average income of 1983. For the 
twenty years, the inflation factor is about 1.7, thus most, if 
not all, of the nominal increase in T is inflation. Also shown 
in Fig.[29]is the nominal gross domestic product (GDP) per 
capita 1 1 18], which increases in time similarly to T and CPI. 
The ratio r*/T varies between 4.8 and 3.2 in Fig. 1291 

In Fig. [30] we show how the parameters of the Pareto tail 
C(r) oc l/r a change in time. Curve (a) shows that the power- 
law index a varies between 1.8 and 1.4, so the power law is 
not universal. Because a power law decays with r more slowly 
than an exponential function, the upper tail contains more in- 
come than we would expect for a thermal distribution, hence 
we call the tail "superthermal" ll 17ll . The total excessive in- 
come in the upper tail can be determined in two ways: as the 
integral J r °° dr'r' 'P(r') of the power-law distribution, or as 
the difference between the total income in the system and the 
income in the exponential part. Curves (c) and (b) in Fig. 1301 
show the excessive income in the upper tail, as a fraction / 
of the total income in the system, determined by these two 
methods, which agree with each other reasonably well. We 
observe that / increased by the factor of 5 between 1983 and 
2000, from 4% to 20%, but decreased in 2001 after the crash 
of the US stock market. For comparison, curve (e) in Fig. 
1301 shows the stock market index S&P 500 divided by infla- 
tion. It also increased by the factor of 5.5 between 1983 and 
1999, and then dropped after the stock market crash. We con- 
clude that the swelling and shrinking of the upper income tail 
is correlated with the rise and fall of the stock market. Similar 
results were found for the upper income tail in Japan in Ref. 
ll99ll . Curve (d) in Fig.BOIshows the fraction of population in 
the upper tail. It increased from 1% in 1983 to 3% in 1999, but 
then decreased after the stock market crash. Notice, however, 
that the stock market dynamics had a much weaker effect on 
the average income T of the lower, "thermal" part of income 
distribution shown in Fig. 1291 

For discussion of income inequality, the standard practice 
is to construct the so-called Lorenz curve I107I1 . It is defined 
parametrically in terms of the two coordinates x(r) and y(r) 
depending on the parameter r, which changes from to oo. 
The horizontal coordinate x{r) = JT dr' P{r') is the fraction 
of population with income below r. The vertical coordinate 
y(r) = J Q dr'r'P(r')/ J dr'r'P(r') is the total income of 
this population, as a fraction of the total income in the sys- 
tem. Fig. [^ shows the data points for t he Lo renz curves in 
1983 and 2000, as computed by the IRS itTToll . For a purely 
exponential distribution of income P(r) oc exp(— r/T), the 
formula y = x + (1 — x) ln(l — x) for the Lorenz curve was 
derived in Ref. [104]. This formula describes income distri- 
bution reasonably well in the first approximation [104], but 
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and power-law ( r,) 



(b) r./T 




(d) Inflation (CPI) 
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US, IRS data for 1983 and 2000 



FIG. 29: Temporal evolution of various parameters characterizing 
income distribution. 
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FIG. 30: (a) The Pareto index a of the power-law tail C(r ) oc l/r a . 
(b) The excessive income in the Pareto tail, as a fraction / of the total 
income in the system, obtained as the difference between the total 
income and the income in the exponential part of the distribution, (c) 
The tail income fraction /, obtained by integrating the Pareto power 
law of the tail, (d) The fraction of population belonging to the Pareto 
tail, (e) The stock-market index S&P 500 divided by the inflation 
coefficient and normalized to 1 in 1983. 



visible deviations exist. These deviations can be corrected 
by taking into account that the total income in the system is 
higher than the income in the exponential part, because of the 
extra income in the Pareto tail. Correcting for this diffe rence 
in the normalization of y, we find a modified expression 
for the Lorenz curve 

2/=(l-/)[x+(l-x)ln(l-x)] + /e(x-l), (46) 

where / is the fraction of the total income contained in the 
Pareto tail, and (x — 1) is the step function equal to for 
x < 1 and 1 for x > 1. The Lorenz curve J46> experiences a 
vertical jump of the height / at x = 1, which reflects the fact 
that, although the fraction of population in the Pareto tail is 
very small, their fraction / of the total income is significant. 
It does not matter for Eq. J46i whether the extra income in the 
upper tail is described by a power law or another slowly de- 
creasing function P(r). The Lorenz curves, calculated using 
Eq. (06} with the values of / from Fig. |30| fit the IRS data 




Cumulative percent of tax returns 



FIG. 31: Main panel: Lorenz plots for income distribution in 1983 
and 2000. The data points are from the IRS 1 1 10], and the theoretical 
curves represent Eq. |46l with / from Fig. 1301 Inset: The closed 
circles are the IRS data 1 1 10] for the Gini coefficient G, and the open 
circles show the theoretical formula G — (1 + /) /2. 



points very well in Fig. 1311 

The deviation of the Lorenz curve from the diagonal in Fig. 
13 II is a certain measure of income inequality. Indeed, if ev- 
erybody had the same income, the Lorenz curve would be the 
diagonal, because the fraction of income would be propor- 
tional to the fraction of population. The standard measure of 
income inequality is the so-called Gini coefficient < G < 1, 
which is defined as the area between the Lorenz curve and the 
diagonal, divided by the area of the triangle beneath the di- 
agonal fl07tl . It was calculated in Ref. Jiol that G = 1/2 
for a purely exponential distribution. Temporal evolution of 
the Gini coefficient, as determined by the IRS ill 1(1 . is shown 
in the inset of Fig.|^ In the first approximation, G is quite 
close to the theoretically calculated value 1/2. The agreement 
can be improved by taking into account the Pareto tail, which 
gives G = (1 + /)/2 for Eq. g6}. The inset in Fig.l3T1shows 
that this formula very well fits the IRS data for the 1990s 
with the values of / taken from Fig.[3U] We observe that in- 
come inequality was increasing for the last 20 years, because 
of swelling of the Pareto tail, but started to decrease in 2001 
after the stock market crash. The deviation of G below 1/2 in 
the 1980s cannot be captured by our formula. The data points 
for the Lorenz curve in 1983 lie slightly above the theoretical 
curve in Fig.|^ which accounts for G < 1/2. 

Thus far we discussed the distribution of individual income. 
An interesting related question is the distribution of family 
income Pa^)- If both spouses are earners, and their in come s 
are distributed exponentially as P\(r) oc cxp(— r/T)\ 127], 
then 



P2(r) 



dr'P 1 (r')P 1 (r - r') oc r exp(-r/T). (47) 



Eq. t47l is in a good agreement with the f amily income dis- 
tribution data from the US Census Bureau Il04ll . In Eq. J47b . 
we assumed that incomes of spouses are uncorrelated. This 
assumption was verified by comparison with the data in Ref. 
1 106]. The Gini coefficient for family income distribution 
g3 was found to be G = 3/8 = 37.5% [104], in agree- 



21 



ment with the data. Moreover, the calculated value 37.5% is 
close to the average G for the developed capitalist countries 
of North America and Western Europe, as determined by the 
World Bank floe!. 

On the basis of the analysis presented above, we propose 
a concept of the equilibrium inequality in a society, charac- 
terized by G = 1/2 for individual income and G = 3/8 
for family income. It is a consequence of the exponential 
Boltzmann-Gibbs distribution in thermal equilibrium, which 
maximizes the entropy S = J dr P{r) In P(r) of a dis- 
tribution P(r) under the constraint of the conservation law 
(r) = J °° dr P(r) r = const. Thus, any deviation of income 
distribution from the exponential one, to either less inequality 
or more inequality, reduces entropy and is not favorable by 
the second law of thermodynamics. Such deviations may be 
possible only due to non-equilibrium effects. The presented 
data show that the great majority of the US population is in 
thermal equilibrium. 

Finally, we briefly discuss how the two-class structure of 
income distribution can be rationalized on the basis of a ki- 
netic approach, which deals with temporal evolution of the 
probability distribution P(r,t). Let us consider a diffusion 
model, where income r changes by Ar over a period of time 
At. Then, temporal evolution of P(r, t) is described by the 
Fokker- Planck equation 1 1 19] 



dP 
~dt 



d_ 
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AP 



d_ 

dr 



(BP) 



A 



(Ar) 
At 



B 



((Ar) 



2At 
(48) 

For the lower part of the distribution, it is reasonable to as- 
sume that Ar is independent of r. In this case, the coeffi- 
cients A and B are constants. Then, the stationary sol ution 
df P = of Eq. (I48> gives the exponential distribution Il03ll 
P(r) oc exp(-r/T) with T = B/A. Notice that a mean- 
ingful solution requires that A > 0, i.e. (Ar) < in Eq. 
( I48> . On the other hand, for the upper tail of income distri- 
bution, it is reasonable to expect that Ar oc r (the Gibrat law 
111 1 lln . so A — ar and B = br 2 . Then, the stationary so- 
lution df P = of Eq. d48t gives the power-law distribution 
P(r) oc l/r Q+1 with a = 1 + a/b. The former process is 
additive diffusion, where income changes by certain amounts, 
whereas the latter process is multiplicative diffusion, where 
income changes by certain percentages. The lower class in- 
come comes from wages and salaries, so the additive process 
is appropriate, whereas the upper class income comes from in- 
vestments, capital gains, etc., where the multiplicative process 



is applicable. Ref. l9lil quantitatively studied income kinetics 
using tax data for the upper class in Japan and found that it is 
indeed governed by a multiplicative process. The data on in- 
come mobility in the USA are not readily available publicly, 
but are accessible to the Statistics of Income Research Divi- 
sion of the IRS. Such data would allow to verify the conjec- 
tures about income kinetics. 



The exponential probability distribution P(r) oc 
exp(— r/T) is a monotonous function of r with the most 
probable income r = 0. The probability densities shown in 
Fig. agree reasonably well with this simple exponential 
law. However, a number of other studies found a non- 
monotonous P(r) with a maximum at r ^ and P(0) = . 
These data were fitted by the log-normal 1 10(1 1 1 it Il02ll 
or the gamma distribution 1 1 1 3l ll 14t Il20ll . The origin of 
the discrepancy in the low-income data between our work 
and other papers is not completely clear at this moment. 
The following factors may possibly play a role. First, one 
should be careful to distinguish between personal income 
and group income, such as family and household income. As 
Eq. j47l > shows, the latter is given by the gamma distribution 
even when the personal income distribution is exponential. 
Very often statistical data are given for households and mix 
individual and group income distributions (see more discus- 
sion in Ref. Il04ln . Second, the data from tax agencies and 
census bureaus may differ. The former data are obtained from 
tax declarations of all the taxable population, whereas the 
latter data from questionnaire surveys of a limited sample of 
population. These two methodologies may produce different 
results, particularly for low incomes. Third, it is necessary t o 
disting uish betwe en distributions of money 1 103. Il20l Il2lll . 
wealth 11 14tll22ll . and income. They are, presumably, closely 
related, but may be different in some respects. Fourth, the 
low-income probability density may be different in the USA 
and in other countries because of different Social Security 
or more general policies. All these questions require careful 
investigation in future work. We can only say that the data 
sets analyzed in this paper and our previous papers are well 
described by a simple exponential function for the whole 
lower class. This does not exclu de a possibility that other 
functions can also fit the data 1 123]. However, the exponential 
law has only one fitting parameter T, whereas log-normal, 
gamma, and other distributions have two or more fitting 
parameters, so they are less parsimonious. 
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